Step 1: Speech-to-Text

Initially, we want to create a simple speech-to-text Python file using OpenAI Whisper.

You can test the sample audio file Sample voice link to download.

Create and open a Python file and call it simple_speech2text.py by clicking the link below:

Let's download the file first (you can do it manually, then drag and drop it into the file environment).

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  1. import requests
  2. # URL of the audio file to be downloaded
  3. url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX04C6EN/Testing%20speech%20to%20text.mp3"
  4. # Send a GET request to the URL to download the file
  5. response = requests.get(url)
  6. # Define the local file path where the audio file will be saved
  7. audio_file_path = "downloaded_audio.mp3"
  8. # Check if the request was successful (status code 200)
  9. if response.status_code == 200:
  10. # If successful, write the content to the specified local file path
  11. with open(audio_file_path, "wb") as file:
  12. file.write(response.content)
  13. print("File downloaded successfully")
  14. else:
  15. # If the request failed, print an error message
  16. print("Failed to download the file")

Run the Python file to test it.

  1. 1
  1. python3 simple_speech2text.py

You should see the downloaded audio file in the file explorer.

langchain

Next, implement OpenAI Whisper for transcribing voice to speech.

You can override the previous code in the Python file.

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  1. import torch
  2. from transformers import pipeline
  3. # Initialize the speech-to-text pipeline from Hugging Face Transformers
  4. # This uses the "openai/whisper-tiny.en" model for automatic speech recognition (ASR)
  5. # The `chunk_length_s` parameter specifies the chunk length in seconds for processing
  6. pipe = pipeline(
  7. "automatic-speech-recognition",
  8. model="openai/whisper-tiny.en",
  9. chunk_length_s=30,
  10. )
  11. # Define the path to the audio file that needs to be transcribed
  12. sample = 'downloaded_audio.mp3'
  13. # Perform speech recognition on the audio file
  14. # The `batch_size=8` parameter indicates how many chunks are processed at a time
  15. # The result is stored in `prediction` with the key "text" containing the transcribed text
  16. prediction = pipe(sample, batch_size=8)["text"]
  17. # Print the transcribed text to the console
  18. print(prediction)

Run the Python file and you will get the output.

  1. 1
  1. python3 simple_speech2text.py
langchain

In the next step, we will utilize Gradio for creating interface for our app.